Extracting Hidden Information from Knowledge Networks 
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We develop a method allowing us to reconstruct individual tastes of customers from a sparsely con- 
nected network of their opinions on products, services, or each other. Two distinct phase transitions 
occur as the density of edges in this network is increased: above the first - macroscopic predic- 
tion of tastes becomes possible, while above the second - all unknown opinions can be uniquely 
reconstructed. We illustrate our ideas using a simple Gaussian model, which we study using both 
field-theoretical methods and numerical simulations. We point out a potential relevance of our 
approach to the field of bioinformatics. 
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Mainstream economics maintains that human tastes 
reflected in consumer preferences are sovereign, i.e not 
subject to discussion or study. It postulates that con- 
sumer's choice of products or services is the outcome of 
a complete and thorough optimization among all possi- 
ble options, and, therefore, his/her satisfaction cannot 
be further improved. Such a doctrine, though often chal- 
lenged from both within Q and outside of economics, is 
still dominant. However, recently many business practi- 
tioners started to exploit the affinity in people's tastes 
in order to predict their personal preferences and come 
up with individually-tailored recommendations. Our ba- 
sic premise is that people's consumption patterns are 
not based on the complete optimization over all possi- 
ble choices. Instead, they constitute just a small re- 
vealed part of the vast pool of "hidden wants". These 
hidden wants, if properly exploited, can lead to much 
better matches between people and products, services, 
or other people. In the economy of the past such oppor- 
tunities were hardly exploitable. Things have changed in 
the course of the current information revolution, which 
both connected people on an unprecedented scale, and 
allowed for easy collection of the vast amount of infor- 
mation on customer's preferences. In just a few years 
the internet has already changed much of our traditional 
perceptions about human interactions, both commercial 
and social. We believe that technical advances in wireless 
and other network interfaces are imminent of being able 
to capture the necessary information virtually free and 
to put this theory to use. 

Our aim is to predict yet unknown individual con- 
sumer preferences, based on the pattern of their corre- 
lations with already known ones. Predictive power obvi- 
ously depends on the ratio between the known and yet 
unknown parts. When the fraction of known opinions 
p is too small, only occasional predictions are possible. 
When it surpasses the first threshold, that we refer to as 
pi, almost all unobserved preferences acquire some de- 
gree of predictability. Finally, for p above the second 
higher threshold P2 , all these unobserved preferences can 
be uniquely reconstructed. In what follows we describe 



a simple model of how customer's opinions are formed 
and spell out in some details basic algorithms allowing 
for their prediction. 

To make this discussion somewhat less abstract let 
us consider a matchmaker or an advisor service which 
already exists on many book-selling websites that per- 
sonally recommends new books to each of their cus- 
tomers. In order for such recommendation to be suc- 
cessful one needs to assume the existence of some "hid- 
den metrics" in the space of reader's tastes and book's 
features. In other words, the matchmaking is possi- 
ble only if opinions of two people with similar tastes 
on two books with similar features are usually not too 
far from each other. In this work we use the sim- 
plest realization of this hidden metrics. We assume that 
each reader is characterized by an M-dimensional array 
r = (r^,^ 2 ), . . . ,A M ^>) of his/her tastes in books, while 
each book has the corresponding list of M basic "fea- 
tures" b = (6«,M 2 ),...,6( M )) Q. An opinion of a 
reader on a book is given simply by an overlap (scalar 
product) f2 of reader's vector of tastes, and book's vec- 
tor of features: = r • b = V , r^b^ a \ The match- 

Z-^a—l 

maker has some incomplete knowledge about opinions of 
his customers on the books they have read, and he uses 
it to reconstruct yet unknown opinions (overlaps) and to 
recommend books to its customers. 

The central position of our matchmaker with respect 
to its customers makes its services dramatically differ- 
ent from those of the so-called "smart agents" £| , whose 
goal is to anticipate and predict tastes of their individ- 
ual owners. Indeed, the scope of recommendations of 
a smart agent is severely limited by the fact that each 
of them serves its own master, so that others would not 
cooperate. On the other hand, our matchmaker is a com- 
pletely neutral player in an economic game, who is able 
to synergistically use the knowledge collected by all play- 
ers/agents to everybody's advantage (including his own). 

The information about who-read-what is best visual- 
ized as a bipartite undirected graph in which vertices 
corresponding to readers are connected by edges to ver- 
tices corresponding to books each of them has read and 
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reported opinion to the matchmaker. Similar graphs (or 
networks) were recently drawn to the center of attention 
of the statistical physics community |}]-|5| under a name 
"small world networks". For example, statistical prop- 
erties of a bipartite graph of movie actors connected to 
films they appeared in were studied in while that 

of scientists and papers which they co-authored - in [^j . 
In this paper we go beyond empirical studies or simple 
growth models of such graphs. The new feature making 
the graphs introduced in this work richer than ordinary 
undirected graphs is that in our graphs each vertex has 
a set of M "hidden" internal degrees of freedom. Conse- 
quently, each edge carries a real number f2, representing 
the similarity or overlap between these internal degrees of 
freedom on two vertices it connects. In our case this num- 
ber quantifies the matchmaker's knowledge of an opinion 
that a given customer has on a given product. Therefore, 
we would refer to such graphs as knowledge or opinion 
networks. 

In the most general case any two vertices in the knowl- 
edge network can be connected by an edge. It is realized 
for instance if vectors rj, rj, . . .rjy stand for strings of 
individual "interests" in a group of N people. The over- 
lap Ojj = Yi ■ Yj measures the similarity of interests for a 
given pair of people and can be thought of as the "quality 
of the match" between them. The matchmaker's goal is 
to analyze this information and to recommend to a cus- 
tomer i another customer j, whom he has not met yet, 
and who is likely to have a large positive overlap with 
his/her set of interests. Mutual opinions can be conve- 
niently stored in an N x N symmetric matrix of scalar 
products Q. In the above case any element of this ma- 
trix can be in principle "reported" to the matchmaker. 
Different restrictions imposed on this most general sce- 
nario describe other versions of our basic model such as: 
1) An advisor service recommending jVj products to N r 
customers (e.g. our model of books and readers from 
the introduction). In this case the square matrix Q has 
iV r + Nb rows and columns, while all entries known to 
the matchmaker are restricted to the N r x Nb rectangle, 
corresponding to opinions of customers on products. 2) 
A real matchmaking service recommending N m men and 
N w women to each other. Here we assume that each man 
and woman can be characterized by two M-dimensional 
vectors: the first one is the vector q of his/her own "qual- 
ities" , while the second one d represents the set of his/her 
"desires", i.e. desired ideal qualities that he/she is seek- 
ing in his/her partner. The opinion of a person % on a 
person j is then given by a scalar product • , while 
the opposite opinion has in general a completely different 
value dj • q,. The full (2N m + 2N W ) x (2N m + 2N W ) over- 
lap matrix is still symmetric but only two small sectors, 
containing N m x N w elements each, are accessible to the 
matchmaker. 

With a small modification this last scenario can be 
applied to a completely different problem, namely that 
of physical interactions between in a set of biological 



molecules such as proteins. It is known that high speci- 
ficity of such interactions is achieved by the virtue of 
the "key-and-lock" matching of features on their surfaces. 
Given the space of possible shapes of locks and keys, each 
molecule can be described by two vectors 1^, of O's and 
l's which determine which keys and locks are present on 
its surface. Provided that the key k a uniquely fits to the 
lock l a , the strength of the interaction between these two 
molecules is determined by fly = • lj + kj • 1^. 

In the rest of the paper we concentrate only on the 
most general non-bipartite case of an N x N matrix 
of overlaps of interests in a group of N customers and 
leave other more restricted situations for future work |jj . 
The matchmaker always has only partial and noisy in- 
formation about the matrix fl due to several factors: 
1) First and most importantly, the matchmaker knows 
only some of the opinions fly of his customers on each 
other, which he uses to guess the rest. 2) In real life 
the overlap could never be precisely measured. In the 
simplest case of an extremely narrow information chan- 
nel customers report to the matchmaker only the sign of 
their overlap with other customers. One can also imagine 
a somewhat wider channel, where the matchmaker asks 
his customers to rate their satisfaction by a grade system, 
the finer the better. 3) The loss of information due to 
a narrow channel between the matchmaker and its cus- 
tomers can be further complicated by a random noise in 
reporting, which would inevitably be present in real life 
situations. Indeed, we are far from assuming that the 
scalar product of tastes and features completely deter- 
mines the customer satisfaction with a product, or that 
similarity of interests is all that matters when two peo- 
ple form an opinion about each other. One should always 
leave room for an idiosyncratic reaction, which does not 
result from any logical weighting of features. Our hope is 
that strong mutually reinforcing correlations due to the 
redundance of information stored in an idealized matrix 
f2 would manifest themselves in a large enough group 
of customers even when they are masked by a substan- 
tial amount of idiosyncratic noise. In principle all these 
three sources of noise and partial information are present 
simultaneously. However, in this work we will treat them 
separately and restrict ourselves only to the case where 
the matchmaker knows the exact values of all overlaps, 
reported to him. It is easy to see how correlations be- 
tween matrix elements allow the matchmaker to succeed 
in his goal of prediction of yet unknown overlaps. For ex- 
ample, the known values of fli2 = ri -T2, and fl23 = i"2 ^3 
somewhat restrict the possible mutual orientation of vec- 
tors ri and r3, and, therefore, contain information about 
the value of the yet unknown overlap O23. Below we 
will demonstrate that the predictability of an overlap be- 
tween two points that are already connected by a chain of 
known overlaps of length L is proportional to M - ^ -1 )' 2 
and, therefore, exponentially decays with L for M > 1. 
Hence, an appreciable prediction becomes only possible 
when two points are connected by exponentially many 
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mutually reinforcing paths. 

The amount of information collected by the match- 
maker on its customers can be conveniently characterized 
by either the number K or the density p = 2K/N(N — 1) 
of known overlaps among all N(N — l)/2 off-diagonal 
elements of the matrix. For very small K all edges of 
the knowledge network are disconnected and no predic- 
tion is possible. As more and more edges are randomly 
added to the network, the chance that a new edge would 
join two previously connected points, i.e the probability 
to form a loop in the network, increases. It is exactly in 
this situation the matchmaker had some predictive power 
about the value of the new overlap before it was observed. 
However, this excess information would disappear in the 
thermodynamic limit N — > oo until the density of edges 
reaches the first threshold p% — l/(N — 1). This thresh- 
old is nothing else but a percolation transition, above 
which the Giant Connected Component (GCC) appears 
in a random graph. For p > Pi the fraction of nodes 
in the GCC rapidly grows, exponentially approaching 
100%. It means that already for a moderate ratio p/pi 
almost every new edge added to the graph would join two 
previously connected points. This transition would also 
manifest itself in the behavior of the entropy of the joint 
probability distribution of unknown overlaps || . 

One has to remember though that the predictive power 
of the matchmaker is exponentially small for long loops. 
That means that while the typical diameter of the graph 
is still large, the loop correlation is too weak to signifi- 
cantly bias most of the unknown overlaps. The reliable 
prediction becomes possible only for much higher values 
of p. Let us calculate p 2 - the point of the second phase 
transition, above which the values of all unknown over- 
laps are completely determined by the information con- 
tained in known ones. Using a geometrical language at 
this point the knowledge network undergoes a "rigidity 
percolation" phase transition, at which relative orienta- 
tions of vectors become fixed. Such transition is pos- 
sible only ior N > M since only in this case f2 contains 
redundant information about components of all vectors 
ri. The position of the second phase transition p 2 can 
be determined by carefully counting the degrees of free- 
dom. For N > M the overlap matrix fi has very special 
spectral properties: it has precisely N — M zero eigen- 
values, while the remaining eigenvalues are strictly pos- 
itive. An easy way to demonstrate this is to recall that 
the overlap matrix can be written as il — Rw , where 
R is the N x M rectangular matrix formed by vectors 
r^' — Ri a . The Singular Value Decomposition (SVD) 
technique allows one to "diagonalize" R (N > M), that 
is to find an M x M orthogonal matrix V, (VV^ = 1), 
an M x M positive diagonal matrix D, and an TV x M 
matrix U formed by M orthonormal A^-dimensional vec- 
tors, such that R = UDV. Now it is easy to see that 
n = UD 2 W has precisely M positive eigenvalues equal 
to squares of the elements of the diagonal matrix D, and 



N — M zero eigenvalues. The number of degrees of free- 
dom of Q is equal to the NM degrees of freedom of R 
minus M(M — l)/2 of the "gauge" degrees of freedom 
of the orthogonal matrix V , which have no influence on 
elements of O. Once the number of known elements K 
exceeds the total number of degrees of freedom of the 
remaining unknown elements of Q can be in principle 
reconstructed. Therefore, the second phase transition 
happens at 



M(2N- M+l) 

* = V-D * 2M/N 



(1) 



Here the ~ sign corresponds to the limit N 3> M. 

Practically however, in order to calculate the set of 
unknown overlaps one needs to solve a system of nonlin- 
ear equations with a huge number of unknown variables, 
which is a daunting task. To this end we came up with 
a simple and efficient iterative numerical algorithm, that 
uses the special spectral properties of f2: (1) Construct 
the initial approximation Q a to by substituting for all 
its unknown elements; (2) Diagonalize fi a , and construct 
the matrix £l' a by keeping the M largest (positive) eigen- 
values and eigenvectors of Q, a , while setting the remaining 
N — M eigenvalues to zero. (3) Construct the new refined 
approximate matrix fl a by copying all unknown elements 
from Q' a , while resetting the rest to their exactly known 
values. (4) Go to the step (2). As shown in Fig. 1 for 
p > P2 Q a converges to f2 exponentially fast in the num- 
ber of iterations n. Numerical simulations also indicate 
that the rate of this exponential convergence scales as 
(p — P2) 2 above the second phase transition (see the inset 
in Fig. 1). 

Below P2 this algorithm performs rather poorly and 
the error may even grow with the number of iteration 
steps. This is to be expected since in this region there is 
more than one solution for the Q, consistent with a set 
of constraints, imposed by K known matrix elements. 
While our iterative algorithm always converges to one of 
such solutions, barring an unlikely accident, this solution 
is far from the set of "true" values of unknown matrix 
elements. In this situation the best thing that a match- 
maker can do is to calculate the average value {ft pq ) of 
each unknown element in the ensemble of all matrices, 
consistent with a given set of K constraints. We have 
succeeded in estimating (Sl p(3 ) analytically. This calcula- 
tion involves rather heavy algebra and will be reported 
elsewhere M. 

In the above discussion the parameter M was treated 
as fixed and known property of the system. However, in 
real life one usually does not know a priori the number 
of relevant components of an idealized vector of tastes or 
features. Here we want to propose a criterion on how to 
optimally choose it. If the number of known overlaps K is 
small, it would be useless to try to model the matrix using 
a high-dimensional space of tastes. Indeed, all the free 
play allowed by a large M would not give the matchmaker 
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much of a prediction power anyway. This leads us to a 
conjecture that the optimal way for a matchmaker to 
select an effective number of internal degrees of freedom 
Meg is to do it in such a way that the system is balanced 
precisely at or near the critical threshold p2- In other 
words, given K and ./V one should solve the equation 
iV M c ff-M c ff(M ff- l)/2 = K to find M cS = [N + 1/2 - 
y/(N + 1/2) 2 - 2if] S ir/AT. 

Finally we introduce a particularly simple analyti- 
cally tractable example of an knowledge network, where 
each component rf of a hidden vector is inde- 
pendently drawn from a normal distribution. The 
joint probability distribution P(f2) of all N(N + 
l)/2 elements of the (symmetric) overlap matrix f2 
is then given by a multidimensional integral P(Q) = 



=1 r i r j )■ Using the standard integral representa- 
tion for the (5-function, S(x) — exp(i\x) dX/(2n), 
and calculating exactly the path integral, now quadratic 
in r| , one arrives at a remarkably elegant and compact 
expression M: 



Prediction power was shown to strongly depend on the 
ratio between these two parts. While our original moti- 
vation was to model a commercial matchmaking service 
in the internet age, the implications go well beyond. We 
would like to point out that our general framework, de- 
veloped for knowledge networks, could be also of much 
importance in the field of bioinformatics, where cross- 
correlations, mutual interactions, and functions of large 
sets of biological entities such as proteins, DNA binding 
sites, etc., are only partially known. It is conceivable 
that a similar approach applied to e.g. a large matrix of 
protein-protein interactions || would prove to be fruitful. 

We have benefitted from conversations with T. Hwa, 
M. Marsili, C. Tang, Y. Yu and A. Zee. Work at 
Brookhaven National Laboratory was carried out under 
Contract No. DE-AC02-98CH10886, Division of Mate- 
rial Science, U.S. Department of Energy. This work was 
sponsored in part by Swiss National Foundation under 
Grant 20-61470.00. 
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The matrix 1 is the N x N unity matrix, while A is 
a symmetric matrix with elements 2A^ on the diagonal 
and Xij off the diagonal. This expression is the multi- 
dimensional Fourier transform of the joint probability 
distribution P(O), so that $(A) = det(l + iA)~ M / 2 is 
nothing else but the generating function of this distri- 
bution! As usual, Taylor expansion of the generating 
function in powers of around A = allows one to cal- 
culate any imaginable correlation between integer powers 
of . It is more convenient to work with irreducible cor- 
relations, generated by the Taylor expansion of </>(A) = 
ln($(A)) = -(M/2) ln(det(i + iA)) = -(A//2)Tr[ln(i + 
iA)]. A surprising consequence of the above exact ex- 
pression for <j>(A) is that all irreducible correlations of 
matrix elements are proportional to M. In particu- 
lar, the expansion 0(A) = (M/2) ^l = i Tr[(-iA) L ]/L. 
allows one to calculate any correlation of the type 
({fl ili2 Q i2 i a ...Cl iL _ liL Cl iLil )) = M, corresponding to a 
given non self-intersecting loop on the network. The pres- 
ence of such cyclic correlations indicates that signs of ma- 
trix elements are weakly correlated. Taking into account 



that each IfL 



1 M and using scaling arguments it is 



straightforward to demonstrate that the predictability of 
one of the overlaps in the loop of length L based on the 
knowledge of others scales as M - ( L-1 )/ 2 . 

In this letter we have described a general framework al- 
lowing one to predict elements from the unobserved part 
of an knowledge network based on the observed part. 
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FIG. 1. The average error in the value of unknown matrix 
elements of as a function of the number of iterations. All 
are independent Gaussian random numbers. The param- 
eters of the model are M = 9, N = 50, corresponding to 
P2 = 0.34. The inset shows the scaling of an exponential con- 
vergence rate as a function of p — 0.34. The solid line has the 
slope 2. 
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